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Abstract 

Most object detection methods operate by applying a 
binary classifier to sub-windows of an image, followed 
by a non-maximum suppression step where detections on 
overlapping sub-windows are removed. Since the number 
of possible sub-windows in even moderately sized image 
datasets is extremely large, the classifier is typically learned 
from only a subset of the windows. This avoids the com¬ 
putational difficulty of dealing with the entire set of sub¬ 
windows, however, as we will show in this paper, it leads to 
sub-optimal detector performance. 

In particular, the main contribution of this paper is the 
introduction of a new method, Max-Margin Object Detec¬ 
tion (MMOD),for learning to detect objects in images. This 
method does not perform any sub-sampling, but instead op¬ 
timizes over all sub-windows. MMOD can be used to im¬ 
prove any object detection method which is linear in the 
learned parameters, such as HOG or bag-of-visual-word 
models. Using this approach we show substantial perfor¬ 
mance gains on three publicly available datasets. Strik¬ 
ingly, we show that a single rigid HOG filter can outper¬ 
form a state-of-the-art deformable part model on the Face 
Detection Data Set and Benchmark when the HOG filter is 
learned via MMOD. 


1. Introduction 

Detecting the presence and position of objects in an im¬ 
age is a fundamental task in computer vision. For example, 
tracking humans in video or performing scene understand¬ 
ing on a still image requires the ability to reason about the 
number and position of objects. While great progress has 
been made in recent years in terms of feature sets, the ba¬ 
sic training procedure has remained the same. In this pro¬ 
cedure, a set of positive and negative image windows are 
selected from training images. Then a binary classifier is 
trained on these windows. Lastly, the classifier is tested 
on images containing no targets of interest, and false alarm 
windows are identified and added into the training set. The 
classifier is then retrained and, optionally, this process is 
iterated. 


This approach does not make efficient use of the avail¬ 
able training data since it trains on only a subset of image 
windows. Additionally, windows partially overlapping an 
object are a common source of false alarms. This training 
procedure makes it difficult to directly incorporate these ex¬ 
amples into the training set since these windows are neither 
fully a false alarm or a true detection. Most importantly, the 
accuracy of the object detection system as a whole, is not 
optimized. Instead, the accuracy of a binary classifier on 
the subsampled training set is used as a proxy. 

In this work, we show how to address all of these is¬ 
sues. In particular, we will show how to design an opti¬ 
mizer that runs over all windows and optimizes the perfor¬ 
mance of an object detection system in terms of the number 
of missed detections and false alarms in the final system 
output. Moreover, our formulation leads to a convex opti¬ 
mization and we provide an algorithm which finds the glob¬ 
ally optimal set of parameters. Finally, we test our method 
on three publicly available datasets and show that it sub¬ 
stantially improves the accuracy of the learned detectors. 
Strikingly, we find that a single rigid HOG filter can outper¬ 
form a state-of-the-art deformable part model if the HOG 
filter is learned via MMOD. 

2. Related Work 

In their seminal work, Dalai and Triggs introduced the 
Histogram of Oriented Gradients (HOG) feature for de¬ 
tecting pedestrians within a sliding window framework |3|. 
Subsequent object detection research has focused primar¬ 
ily on finding improved representations. Many recent ap¬ 
proaches include features for part-based-modeling, methods 
for combining local features, or dimensionality reduction 
mmmmm. All these methods employ some form 
of binary classifier trained on positive and negative image 
windows. 

In contrast, Blaschko and Lampert’s research into struc¬ 
tured output regression is the most similar to our own m. 
As with our approach, they use a structural support vector 
machine formulation, which allows them to train on all win¬ 
dow locations. However, their training procedure assumes 
an image contains either 0 or 1 objects. While in the present 
work, we show how to treat object detection in the general 
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Algorithm 1 Object Detection 
Input: image x, window scoring function / 
l: V := all rectangles r eTZ such that /(x, r) > 0 
2: Sort V such that Vi>V 2 >V%>... 

3: y* := {} 

4: for i = 1 to \V\ do 

5: if Vi does not overlap any rectangle in y* then 

6: y* := y* U {£>;} 

7: end if 

8 : end for 

9: Return: y* 9 The detected object positions. 



Figure 1. Three sliding windows and their / scores. Assume non¬ 
max suppression rejects any rectangles which touch. Then the op¬ 
timal detector would select the two outside rectangles, giving a 
total score of 12, while a greedy detector selects the center rectan¬ 
gle for a total score of only 7. 


setting where an image may contain any number of objects. 


3. Problem Definition 

In what follows, we will use r to denote a rectangular 
area of an image. Additionally, let 7 Z denote the set of 
all rectangular areas scanned by our object detection sys¬ 
tem. To incorporate the common non-maximum suppres¬ 
sion practice, we define a valid labeling of an image as a 
subset of 1Z such that each element of the labeling “does 
not overlap” with each other. We use the following popu¬ 
lar definition of “does not overlap”: rectangles rq and r 2 do 
not overlap if the ratio of their intersection area to total area 
covered is less than 0.5. That is, 


Areajjri D r 2 ) Q g 
Area(r\ Ur 2 ) 


( 1 ) 


Finally, we use y to denote the set of all valid labelings. 

Then, given an image x and a window scoring function 
/, we can define the object detection procedure as 

y* = argmax E f(x,r). (2) 

y^y r ^y 

That is, find the set of sliding window positions which 
have the largest scores but simultaneously do not overlap. 
This is typically accomplished with the greedy peak sorting 
method shown in Algorithm [T] An ideal learning algorithm 
would find the window scoring function which jointly min¬ 
imized the number of false alarms and missed detections 
produced when used in Algorithm [I] 

It should be noted that solving Equation ([2]) exactly is 
not computationally feasible. Thus, this algorithm does not 
always find the optimal solution to An example which 
leads to suboptimal results is shown in Figure [T] However, 
as we will see, this suboptimal behavior does not lead to 
difficulties. Moreover, in the next section, we give an op¬ 
timization algorithm capable of finding an appropriate win¬ 
dow scoring function for use with Algorithm [T] 


4. Max-Margin Object Detection 

In this work, we consider only window scoring functions 
which are linear in their parameters. In particular, we use 
functions of the form 

f(x,r) = (w,4>(x,r)) (3) 

where </> extracts a feature vector from the sliding window 
location r in image x, and w is a parameter vector. If we 
denote the sum of window scores for a set of rectangles, y, 
as F(x, y), then Equation § becomes 

y* = argmax F{x, y) = argmax (w, <fi(x, r)). (4) 

y^y y^y rey 

Then we seek a parameter vector w which leads to the 
fewest possible detection mistakes. That is, given a ran¬ 
domly selected image and label pair (. Xi,yi ) G X x y, 
we would like the score for the correct labeling of Xi to be 
larger than the scores for all the incorrect labelings. There¬ 
fore, 

F(xi,yi) > maxF(xi,y) (5) 

y^yi 

should be satisfied as often as possible. 

4.1. The Objective Function for Max-Margin Ob¬ 
ject Detection 

Our algorithm takes a set of images {#i, x 2 ,x n } C 
X and associated labels { 2/1 ? 2/2 ? •••, y n } C y and attempts 
to find a w such that the detector makes the correct predic¬ 
tion on each training sample. We take a max-margin ap¬ 
proach ma and require that the label for each training sam¬ 
ple is correctly predicted with a large margin. This leads to 
the following convex optimization problem: 

min ^IMI 2 (6) 

w Z 

s.t. F(xi,yi) > max [F(xi,y) + A(y,yi)], Vi 
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Where A (y,yi) denotes the loss for predicting a labeling 
of y when the true labeling is In particular, we define the 
loss as 


A(y, yi) =L m i ss • (# of missed detections) + (7) 

Lf a • (# of false alarms) 

where L miss and Lf a control the relative importance of 
achieving high recall and high precision, respectively. 

Equation ([6]) is a hard-margin formulation of our learn¬ 
ing problem. Since real world data is often noisy, not per¬ 
fectly separable, or contains outliers, we extend this into the 
soft-margin setting. In doing so, we arrive at the defining 
optimization for Max-Margin Object Detection (MMOD) 


4.2. Solving the MMOD Optimization Problem 

We use the cutting plane method |9| 171 to solve the 
Max-Margin Object Detection optimization problem de¬ 
fined by Equation ([8]). Note that MMOD is equivalent to 
the following unconstrained problem 

min J(w) = ~|M| 2 + R e m P ( w) (13) 

w 2 

where R e mp{w) is 


C n 

— Y] max [F(xi, y ) + A (y, y t ) - F(xi,yi )]. (14) 

n f yey 

i=i 


min 




( 8 ) 


s.t. F(xi, yi) > max [F{x u y) + A (y, y»)] - Vi 
yey 


6 > 0, Vi 


In this setting, C is analogous to the usual support vector 
machine parameter and controls the trade-off between try¬ 
ing to fit the training data or obtain a large margin. 

Insight into this formulation can be gained by noting that 
each & is an upper bound on the loss incurred by training 
example (xi,yi). This can be seen as follows (let g(x) = 
argmax^gy F(x, y)) 

& > max [F(xi, y) + A (y, y,)] - F(xi, yi) (9) 

yey 

& > [F(xi,g(xi)) + A (g(xi), y»)] - F(x i; y t ) (10) 

6 > A(g(xi),yi) ( 11 ) 

In the step from to © we replace the max over y with 
a particular element, g(xi). Therefore, the inequality con¬ 
tinues to hold. In going from ( |10| ) to CD we note that 
F(xi,g(xi)) — F(xi,yi) > 0 since g(xi) is by definition 
the element of y which maximizes F(xi,-). 

Therefore, the MMOD objective function defined by 
Equation ([5]) is a convex upper bound on the average loss 
per training image 

C n 

— V'A( a rgmaxF(a; i ,y),y i ). (12) 

n yey 

This means that, for example, if f,- from Equation ([8]i is 
driven to zero then the detector is guaranteed to produce the 
correct output from the corresponding training example. 

This type of max-margin approach has been used suc¬ 
cessfully in a number of other domains. An example is the 
Hidden Markov SVM m, which gives state-of-the-art re¬ 
sults on sequence labeling tasks. Other examples include 
multiclass SVMs and methods for learning probabilistic 
context free grammars flOl . 


Further, note that R ernv is a convex function of w and 
therefore is lower bounded by any tangent plane. The cut¬ 
ting plane method exploits this to find the minimizer of J. 
It does this by building a progressively more accurate lower 
bounding approximation constructed from tangent planes. 
Each step of the algorithm finds a new w minimizing this 
approximation. Then it obtains the tangent plane to R ernp 
at w, and incorporates this new plane into the lower bound¬ 
ing function, tightening the approximation. A sketch of the 
procedure is shown in Figure [2] 



Figure 2. The red curve is lower bounded by its tangent planes. 
Adding the tangent plane depicted by the green line tightens the 
lower bound further. 


Let dR ernp (w t ) denote the subgradient of R e mp at a 
point w t . Then a tangent plane to R e mp at w t is given by 

(■ w,a) + b (15) 


where 


cl G dR ernp (w t ) (16) 

b R emp (wt^ (w u a). (17) 


Given these considerations, the lower bounding approx¬ 
imation we use is 



F Remp (^) — 



+ max \(w,a)-\-b] (18) 
(a,b)eP 
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Algorithm 2 MMOD Optimizer 

Input: 5 > 0 

1: Wq := 0, t := 0, P := {} 

2: repeat 

3: t I— t -hi 

4: Compute plane tangent to R emp (w t - i), select G 

^Rempi^t— l) ^nd bf . R ern p{Wi—\ ) ( Wt—X-fCLt ) 

5: := P t -i U {(at, 6t)} 

6: Let K t {w) = i|H| 2 + max (a . ib . )e p t [(w;,a,) + 6*] 

7: w t := argmin TO K t (w) 

8: until |||wt|| 2 + Remp{w t ) - K t (w t ) < e 

9: Return: w t 


where P is the set of lower bounding planes, i.e. the “cut¬ 
ting planes”. 

Pseudocode for this method is shown in Algorithm [2] It 
executes until the gap between the true MMOD objective 
function and the lower bound is less than e. This guarantees 
convergence to the optimal w* to within e. That is, we will 
have 

I J(w*) - J(w t ) | < e (19) 

upon termination of Algorithm [2] 

4.2.1 Solving the Quadratic Programming Subprob¬ 
lem 

A key step of Algorithm [2] is solving the argmin on step 
7. This subproblem can be written as a quadratic program 
and solved efficiently using standard methods. Therefore, 
in this section we derive a simple quadratic program solver 
for this problem. We begin by writing step 7 as a quadratic 
program and obtain 

mi n hlH| 2 +£ (2°) 

Z 

s.t. £ > (w, CLi) + bi, V(«j, bi) e P. 

The set of variables being optimized, w , will typically 
have many more dimensions than the number of constraints 
in the above problem. Therefore, it is more efficient to solve 
the dual problem. To do this, note that the Lagrangian is 

i |p| 

L{w ,£, A) = ^|M| 2 + £~yA(£- < w ,a») ~ b i)- ( 21 ) 

i =1 

and so the dual dn of the quadratic program is 

max L(w, A) (22) 

w,£,\ 

s.t. S7wL(w,£,\)=0, 

V(L(w,£,\) =0, 

Xi > 0, Vi 


Algorithm 3 Quadratic Program Solver for Equation ([23]) 

Input: Q, 6, A, e qp > 0 

l 

o 

7 

o 

1—1 

II 

2 

repeat 

3 

1 

Q> 

II 

> 

4 

big := —oc 

5 

little := oo 

6 

l :=0 

7 

b := 0 

8 

for i = 1 to v do 

9 

if Vi > big and A^ > 0 then 

10 

big := Vi 

11 

b := i 

12 

end if 

13 

if Vi < little then 

14 

little := Vi 

15 

l := i 

16 

end if 

17 

end for 

18 

gap := A T v —little 

19 

z := A5 + A i 

20 

x := max(r, Q bb + Qu - 2 Q u ) 

21 

X b := X b — ( big — little)/x 

22 

Xi := A i + ( big — little)/x 

23 

if X b < 0 then 

24 

A 6 := 0 

25 

A i := 2? 

26 

end if 

27 

until gap < e qp 

28 

Return: A 


After a little algebra, the dual reduces to the following 
quadratic program, 

max X T b— -X T QX (23) 

A Z 

\p\ 

s.t. \j > o, \j = i 

i=1 

where A and b are column vectors of the variables A i and bi 
respectively and Qij = (a*, ctj). 

We use a simplified variant of Platt’s sequential minimal 
optimization method to solve the dual quadratic program of 
Equation ( [23] ) lH31 . Algorithm [3] contains the pseudocode. 
In each iteration, the pair of Lagrange multipliers (A*,, A i) 
which most violate the KKT conditions are selected (lines 
6-13). Then the selected pair is jointly optimized (lines 15- 
21). The solver terminates when the duality gap is less than 
a threshold. 

Upon solving for the optimal A*, the w t needed by step 
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7 of Algorithm [2] is given by 


1^1 

^ = -£A*a,. (24) 

i=1 

The value of min w K(w) needed for the test for conver¬ 
gence can be conveniently computed as 

\* T b- l -\\w t \\ 2 . (25) 

Additionally, there are a number of non-essential but 
useful implementation tricks. In particular, the starting A 
should be initialized using the A from the previous iteration 
of the MMOD optimizer. Also, cutting planes typically be¬ 
come inactive after a small number of iterations and can be 
safely removed. A cutting plane is inactive if its associated 
Lagrange multiplier is 0. Our implementation removes a 
cutting plane if it has been inactive for 20 iterations. 

4.2.2 Computing R ernp and its Subgradient 

The final component of our algorithm is a method for 
computing R e mp and an element of its subgradient. Recall 
that F(x,y) and R ern p are 

Fix-, y ) = yy w, r)) (26) 

rEy 

C n 

Remp{w) = — V' max [F(xi, y) + A (y, y t ) - F(xi, y»)] . 
n ' yey 
i= 1 

(27) 


Then an element of the subgradient of R em p is 


C 


dRempiw) — ''y ' 

T) ^ 


i= 1 


^2 Hxi,r) - y] 4>{xi,r) 

rEy* rEyi 


(28) 


where 


y* = argmax 
yey 


A(y,t/i) + 

rEy 


(29) 


Our method for computing y* is shown in Algorithm [4] 
It is a modification of the normal object detection proce¬ 
dure from Algorithm [I] to solve Equation ( [29] ) rather than 
0- Therefore, the task of Algorithm [4] is to find the set of 
rectangles which jointly maximize the total detection score 
and loss. 

There are two cases to consider. First, if a rectangle does 
not hit any truth rectangles, then it contributes positively to 
the argmax in Equation [29] whenever its score plus the loss 
per false alarm ( Lf a ) is positive. Second, if a rectangle hits 
a truth rectangle then we reason as follows: if we reject the 


Algorithm 4 Loss Augmented Detection 

Input: image x, true object positions y, weight vector w, 

Lmissf Lf a 

1: V := all rectangles r G 1Z such that (w, r)) + 

Lfa > 0 

2: Sort V such that V\ > > R 3 > ... 

3: s r := 0, h r := false , Vr G y 
4: for i m 1 to \V\ do 

5: if Vi does not overlap Vi_ 2 , •••} then 

6: if Vi matches an element of y then 

7: r := best matching element of y 

8 : if h r = false then 

9: s r := (w,4>(x,Vi)) 

10: h r := true 

11: else 

12: S r .— S r (wj (f)(Xj 'Vifj H - Lfa 

13: end if 

14: end if 

15: end if 

16: end for 

17: y* := {} 

18: for i = 1 to \V\ do 

19: if Vi does not overlap y* then 

20: if Vi matches an element of y then 

21: r := best matching element of y 

22: if s r > Lmiss then 

23: y* := y* U { V 

24: end if 

25: else 

26: y* := U {V z } 

27: end if 

28: end if 

29: end for 

30: Return: y* 9 The detected object positions. 


first rectangle which matches a truth rectangle then, since 
the rectangles are sorted in descending order of score, we 
will reject all others which match it as well. This outcome 
results in a single value of L m i SS . Alternatively, if we ac¬ 
cept the first rectangle which matches a truth rectangle then 
we gain its detection score. Additionally, we may also ob¬ 
tain additional scores from subsequent duplicate detections, 
each of which contributes the value of its window scoring 
function plus Lf a . Therefore, Algorithm [4] computes the 
total score for the accept case and checks it against 
It then selects the result with the largest value. In the pseu¬ 
docode, these scores are accumulated in the s r variables. 


This algorithm is greedy and thus may fail to find the op¬ 
timal y* according to Equation (29). However, it is greedy 
in much the same way as the detection method of Algo¬ 
rithm [T] Moreover, since our goal from the outset is to find 
a set of parameters which makes Algorithm[l]perform well, 
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we should use a training procedure which respects the prop¬ 
erties of the algorithm being optimized. For example, if the 
correct output in the case of Figurefljwas to select the two 
boxes on the sides, then Algorithm [ljwould make a mistake 
while a method which was optimal would not. Therefore, 
it is important for the learning procedure to account for this 
and learn that in such situations, if Algorithm [I] is to pro¬ 
duce the correct output, the side rectangles need a larger 
score than the middle rectangle. Ultimately, it is only nec¬ 
essary for Algorithm[4]to give a value of R ern p which upper 
bounds the average loss per training image. In our experi¬ 
ments, we always observed this to be the case. 

5. Experimental Results 

To test the effectiveness of MMOD, we evaluate it on 
the TU Darmstadt cows oa, INRIA pedestrians 0, and 
FDDB 0 datasets. When evaluating on the first two 
datasets, we use the same feature extraction (</>) and pa¬ 
rameter settings (C, e , e qp , Lf a , and L m i ss ), which are 
set as follows: C = 25, e = 0.15C, Lf a = 1, and 
Lmiss = 2. This value of e means the optimization runs 
until the potential improvement in average loss per training 
example is less than 0.15. For the QP subproblem, we set 
Sqp = min(0.01,0.1 (2 | 11 + Rempi^t) R-t{^t)y) to 

allow the accuracy with which we solve the subproblem to 
vary as the overall optimization progresses. 

For feature extraction, we use the popular spatial pyra¬ 
mid bag-of-visual-words model ED. In our implementa¬ 
tion, each window is divided into a 6x6 grid. Within each 
grid location we extract a 2,048 bin histogram of visual- 
words. The visual-word histograms are computed by ex¬ 
tracting 36-dimensional HOG 0 descriptors from each 
pixel location, determining which histogram bin the feature 
is closest too, and adding 1 to that visual-word’s bin count. 
Next, the visual-word histograms are concatenated to form 
the feature vector for the sliding window. Finally, we add 
a constant term which serves as a threshold for detection. 
Therefore, (f> produces 73,729 dimensional feature vectors. 

The local HOG descriptors are 36 dimensional and are 
extracted from 10x10 grayscale pixel blocks of four 5x5 
pixel cells. Each cell contains 9 unsigned orientation bins. 
Bilinear interpolation is used for assigning votes to orienta¬ 
tion bins but not for spatial bins. 

To determine which visual-word a HOG descriptor cor¬ 
responds to, many researchers compute its distance to an 
exemplar for each bin and assign the vector to the nearest 
bin. However, this is computationally expensive, so we use 
a fast approximate method to determine bin assignment. In 
particular, we use a random projection based locality sensi¬ 
tive hash a This is accomplished using 11 random planes. 
A HOG vector is hashed by recording the bit pattern de¬ 
scribing which side of each plane it falls on. This 11-bit 
number then indicates the visual-word’s bin. 


Finally, the sliding window classification can be imple¬ 
mented efficiently using a set of integral images. We also 
scan the sliding window over every location in an image 
pyramid which downsamples each layer by 4/5. To de¬ 
cide if two detections overlap for the purposes of non-max 
suppression we use Equation 0 - Similarly, we use Equa¬ 
tion ([]]) to determine if a detection hits a truth box. Finally, 
all experiments were run on a single desktop workstation. 

5.1. TU Darmstadt Cows 

We performed 10-fold cross-validation on the TU Darm¬ 
stadt cows ED dataset and obtained perfect detection re¬ 
sults with no false alarms. The best previous results on this 
dataset achieve an accuracy of 98.2% at equal error rate 0. 
The dataset contains 112 images, each containing a side- 
view of a cow. 

In this test the sliding window was 174 pixels wide and 
90 tall. Training on the entire cows dataset finishes in 49 
iterations and takes 70 seconds. 

5.2. INRIA Pedestrians 

We also tested MMOD on the INRIA pedestrian dataset 
and followed the testing methodology used by Dalai and 
Triggs m . This dataset has 2,416 cropped images of people 
for training as well as 912 negative images. For testing it has 
1,132 people images and 300 negative images. 



Figure 3. The y axis measures the miss rate on people images while 
the x axis shows FPPW obtained when scanning the detector over 
negative images. Our method improves both miss rate and false 
positives per window compared to previous methods on the INRIA 
dataset. 

The negative testing images have an average of 199,834 
pixels per image. We scan our detector over an image pyra¬ 
mid which downsamples at a rate of 4/5 and stop when the 
smallest pyramid layer contains 17,000 pixels. Therefore, 
MMOD scans approximately 930,000 windows per nega¬ 
tive image on the testing data. 
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We use a sliding window 56 pixels wide and 120 tall. 
The entire optimization takes 116 minutes and runs for 220 
iterations. Our results are compared to previous methods 
in Figure [3] The detection tradeoff curve shows that our 
method achieves superior performance even though we use 
a basic bag-of-visual-word feature set while more recent 
work has invested heavily in improved feature representa¬ 
tions. Therefore, we attribute the increased improvement to 
our training procedure. 

5.3. FDDB 

Finally, we evaluate our method on the Face Detection 
Data Set and Benchmark (FDDB) challenge. This chal¬ 
lenging dataset contains images of human faces in multi¬ 
ple poses captured in indoor and outdoor settings. To test 
MMOD, we used it to learn a basic HOG sliding window 
classifier. Therefore the feature extractor (</>) takes in a win¬ 
dow and outputs a HOG vector describing the entire win¬ 
dow as was done in Dalai and Triggs’s seminal paper[3]. To 
illustrate the learned model, the HOG filter resulting from 
the first FDDB fold is visualized in Figure [4] 
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Figure 4. The HOG filter learned via MMOD from the first fold of 
the FDDB dataset. The filters from other folds look nearly identi¬ 
cal. 

During learning, the parameters were set as follows: 
C = 50, 5 = 0.01C, Lf a = 1, and L miss = 1. We also up- 
sampled each image by a factor of two so that smaller faces 
could be detected. Since our HOG filter box is 80x80 pix¬ 
els in size this upsampling allows us to detect images that 
are larger than about 40x40 pixels in size. Additionally, 
we mirrored the dataset, effectively doubling the number of 
training images. This leads to training sizes of about 5000 
images per fold and our optimizer requires approximately 
25 minutes per fold. 

To perform detection, this single HOG filter is scanned 
over the image at each level of an image pyramid and any 
windows which pass a threshold test are output after non¬ 
max suppression is performed. A ROC curve that compares 
this learned HOG filter against other methods is created by 
sweeping this threshold and can be seen in Figure [5] To 
create the ROC curve we followed the FDDB evaluation 
protocol of performing 10 fold cross-validation and com¬ 
bining the results in a single ROC curve using the provided 



Figure 5. A comparison between our HOG filter learned via 
MMOD and three other techniques, including another HOG filter 
method learned using traditional means. The MMOD procedure 
results in a much more accurate HOG filter. 


FDDB evaluation software. Example images with detection 
outputs are also shown in Figure [6] 

In Figure [5] we see that the HOG filter learned via 
MMOD substantially outperforms a HOG filter learned 
with the typical linear SVM “hard negative mining” 
approach lH2l as well as the classic Viola Jones method (T9l . 
Moreover, our single HOG filter learned via MMOD gives 
a slightly better accuracy than the complex deformable part 
model of Zhu l22l . 

6. Conclusion 

We introduced a new method for learning to detect ob¬ 
jects in images. In particular, our method leads to a con¬ 
vex optimization and we provided an efficient algorithm 
for its solution. We tested our approach on three publicly 
available datasets, the INRIA person dataset, TU Darmstadt 
cows, and FDDB using two feature representations. On all 
datasets, using MMOD to find the parameters of the detec¬ 
tor lead to substantial improvements. 

Our results on FDDB are most striking as we showed 
that a single rigid HOG filter can beat a state-of-the-art de¬ 
formable part model when the HOG filter is learned via 
MMOD. We attribute our success to the learning method’s 
ability to make full use of the data. In particular, on FDDB, 
our method can efficiently make use of all 300 million slid¬ 
ing window positions during training. Moreover, MMOD 
optimizes the overall accuracy of the entire detector, taking 
into account information which is typically ignored when 
training a detector. This includes windows which partially 
overlap target windows as well as the non-maximum sup¬ 
pression strategy used in the final detector. 

Our method currently uses a linear window scoring func¬ 
tion. Future research will focus on extending this method 
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Figure 6. Example images from the FDDB dataset. The red boxes show the detections from HOG filters learned using MMOD. The HOG 
filters were not trained on the images shown. 


to use more-complex scoring functions, possibly by using 
kernels. The work of Yu and Joachims is a good start¬ 
ing point ED- Additionally, while our approach was in¬ 
troduced for 2D sliding window problems, it may also be 
useful for ID sliding window detection applications, such 
as those appearing in the speech and natural language pro¬ 
cessing domains. Finally, to encourage future research, we 
have made a careful and thoroughly documented implemen¬ 
tation of our method available as part of the open source 
dlitQmachine learning toolbox ifTTTl . 
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